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Abstract 

Background: Insertions and deletions (indels) account for more nucleotide differences between two related DNA 
sequences than substitutions do, and thus it is imperative to develop a method to reliably calculate the occurrence 
probabilities of sequence alignments via evolutionary processes on an entire sequence. Previously, we presented a 
perturbative formulation that facilitates the ab initio calculation of alignment probabilities under a continuous-time 
Markov model, which describes the stochastic evolution of an entire sequence via indels with quite general rate 
parameters. And we demonstrated that, under some conditions, the ab initio probability of an alignment can be 
factorized into the product of an overall factor and contributions from regions (or local alignments) delimited by 
gapless columns. 

Results: Here, using our formulation, we attempt to approximately calculate the probabilities of local alignments under 
space-homogeneous cases. First, for each of all types of local pairwise alignments (PWAs) and some typical types of 
local multiple sequence alignments (MSAs), we numerically computed the total contribution from all parsimonious 
indel histories and that from all next-parsimonious histories, and compared them. Second, for some common types of 
local PWAs, we derived two integral equation systems that can be numerically solved to give practically exact solutions. 
We compared the total parsimonious contribution with the practically exact solution for each such local PWA. Third, 
we developed an algorithm that calculates the first-approximate MSA probability by multiplying total 
parsimonious contributions from all local MSAs. Then we compared the first-approximate probability of each 
local MSA with its absolute frequency in the MSAs created via a genuine sequence evolution simulator, Dawg. In all 
these analyses, the total parsimonious contributions approximated the multiplication factors fairly well, as long as gap 
sizes and branch lengths are at most moderate. Examination of the accuracy of another indel probabilistic model in the 
light of our formulation indicated some modifications necessary for the model's accuracy improvement. 

Conclusions: At least under moderate conditions, the approximate methods can quite accurately calculate ab initio 
alignment probabilities under biologically more realistic models than before. Thus, our formulation will provide other 
indel probabilistic models with a sound reference point. 
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Background 

The evolution of DNA, RNA, and protein sequences is 
driven by mutations such as base substitutions, insertions 
and deletions (indels), recombination, and other genomic 
rearrangements (e.g., [1-3]). Thus far, analyses on sub¬ 
stitutions have predominated in the field of molecular 
evolutionary study, in particular using the probabilistic 
(or likelihood) theory of substitutions that is now 
widely accepted (e.g., [4-6]). This is probably because 
evolutionary models describing residue substitutions 
are relatively easier to handle. However, it must be re¬ 
membered that the study of indels is at least as crucial 
as the study of substitutions. There are two major reasons 
for this. First, it is indels but not substitutions that yield 
the skeletons (or the gap configurations) of the sequence 
alignments (reviewed, e.g., in [7]), which provide essential 
inputs to most homology-based analyses in computational 
biology. And second, some recent comparative genomic 
analyses have revealed that indels account for more base 
differences between the genomes of closely related species 
than substitutions (e.g., [8-12]). These circumstances 
make it imperative to develop a stochastic model that 
enables us to reliably calculate the probability of se¬ 
quence evolution via mutations including insertions 
and deletions. Since the groundbreaking works by 
Bishop and Thompson [13] and by Thorne, Kishino 
and Felsenstein [14], many studies have been done to 
develop and apply methods to calculate the probabil¬ 
ities of pairwise alignments (PWAs) and multiple se¬ 
quence alignments (MSAs) under the probabilistic 
models aiming to incorporate the effects of indels. Such 
methods have greatly improved in terms of the computa¬ 
tional efficiency and the scope of application. See excellent 
reviews for details on this topic (e.g., [15-17]). A majority 
of these studies are based on hidden Markov models 
(HMMs) (e.g., [18]) or transducer theories (e.g., [19]). Both 
of them calculate the indel component of an alignment 
probability as a product of inter-column transition prob¬ 
abilities or of block-wise contributions. And the study on 
these methods is still advancing, strengthening their 
mathematical and algorithmic bases (e.g., [20, 21]). Unfor¬ 
tunately, most of these methods have at least either of two 
fundamental problems, one regarding the evolutionary 
consistency and the other regarding the biological realism. 
(For details, see Background of [22].) Regarding the evolu¬ 
tionary consistency, it is a priori unclear whether or how a 
HMM or a transducer is related with any genuine stochas¬ 
tic evolutionary model (or “evolutionary model” for short), 
which describes the evolution of an entire sequence via 
indels along a time axis. Regarding the biological realism, 
the standard HMMs or transducers can at best handle 
mixed geometric distributions of indel lengths (e.g., [23]) 
(and usually implement simple geometric distributions), 
whereas many empirical studies showed that the real indel 


lengths are distributed according to power-laws (e.g., [24] 
and references therein). Besides, very few studies thus far 
(e.g., [25]) addressed the issue of indel rate variation across 
regions. 

In a previous study [22], we presented a theoretical 
formulation that facilitates the ab initio calculation of 
alignment probabilities under a genuine stochastic evolu¬ 
tionary model, specifically, a general continuous-time 
Markov model of sequence evolution via indels. Our 
evolutionary model was created as a result of incorporat¬ 
ing the idea of position-specific evolutionary rates [25] 
into the most general “substitution/insertion/deletion 
model” [26]. Thus, the model is naturally devoid of the 
aforementioned two problems. Aided by some techniques 
of time-dependent perturbation theory in quantum me¬ 
chanics [27-29], we formally expanded the ab initio prob¬ 
ability of an alignment into a series of terms with different 
numbers of indels. This expansion gave an intuitively 
clearer representation of Fellers theorems [30]. And it 
theoretically underpinned the stochastic evolutionary 
simulation method of Gillespie [31], which provides the 
foundation for genuine sequence evolution simulators 
(e.g., [32-34]). And we also showed that, if the indel 
model parameters satisfy a certain set of conditions, the 
ab initio probability of an alignment is indeed factorable 
into the product of an overall factor and contributions 
from local alignments delimited by preserved ancestral 
sites (PASs), i.e., gapless columns. This suggested that the 
evolutionary models satisfying such conditions could pro¬ 
vide a sort of generalized HMMs, which extend the space- 
and time-homogeneous “long indel” model [26] to some 
space- and time-heterogeneous situations. 

In this paper, we focus on how to concretely calculate 
the contribution from each local alignment, assuming that 
the indel model satisfies the conditions for factorable 
alignment probabilities. (To clearly illustrate the concrete 
computations, we deal with space-homogeneous models 
in the bulk of this manuscript (i.e., in sections R2-R6), and 
briefly discuss extensions to more general cases near the 
end (i.e., in subsection R7.1).) As noted in [26] and section 
R1 of Results and discussion of this manuscript, the con¬ 
tribution from each local alignment is a summation over 
an infinite number of local indel histories. Thus it cannot 
be computed literally exactly within a finite amount of 
time. This makes it necessary to devise some approxima¬ 
tion methods, each of which sums contributions from a fi¬ 
nite number of indel histories (as first proposed in [26]). 
An auspice is that indel rates (say, X ID indels per site per 
unit time) are known to be at most around 1/10 of the 
substitution rates (say, X s substitutions per site per unit 
time) (e.g., [24, 35]). And the probability of an indel his¬ 
tory involving N ID indels is roughly O (i.X ID t) NlD ^j times 
the probability of a history with no indel, where t is a time 
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scale characteristic of the system under consideration. In 
conjunction, these suggest that taking account only of 
histories with minimum and near-minimum required 
numbers of indels may provide a good approximation 
to each local alignment probability, as long as the se¬ 
quence divergences (e.g., X s t) are within the scope of 
phylogenetic analyses (i.e., at most 0(1) substitutions 
per site). 

In section R1 of Results and discussion, we briefly re¬ 
view the relevant portion of the theoretical basis that 
was established in our previous study [22]. We introduce 
simplified notation so that we can focus on a single local 
alignment. In sections R2-R4, we demonstrate how our 
perturbative formulation can be concretely used to ap¬ 
proximately calculate the contributions to the ab initio 
alignment probabilities from local alignments, i.e., align¬ 
ment regions separated by gapless columns. We examine 
all types of local pairwise alignments (PWAs) in section 
R2, and some typical types of local multiple sequence 
alignments (MSAs) in section R4. For each local align¬ 
ment type, we calculate the total parsimonious contribu¬ 
tion and the total next-parsimonious contribution to its 
probability (more precisely, its multiplication factor). In 
section R3, we discuss two systems of integral equations 
that can be numerically solved to give practically exact 
solutions (or “exact” solutions, for short) for some com¬ 
mon types of local PWAs. There, we also study the be¬ 
haviors of the “exact” solutions. Then, by comparing the 
total parsimonious contribution with the total next- 
parsimonious contribution, or with the “exact” solution, 
we investigate the parameter regions in which the total 
parsimonious contribution can approximate the align¬ 
ment probability quite accurately (in sections R2-R4). In 
section R5, we perform simulation analyses with a genuine 
evolutionary simulator, Dawg [32], to examine whether or 
not the conclusions from sections R2-R4 also apply to 
local MSAs of more general types. For this purpose, we 
developed an algorithm to calculate the “first-approxi¬ 
mate” probability of a given MSA under a given parameter 
setting (including a given tree) by multiplying the overall 
factor and the total parsimonious contributions from all 
local MSAs. And we examine the accuracy of the first- 
approximate multiplication factors calculated by the algo¬ 
rithm. In section R6, we use our ab initio formulation as a 
“yardstick” to measure the accuracy of other indel prob¬ 
abilistic models. As a representative model, we chose the 
generalized HMM of [36], which aims for the biological 
realism but not fully for the evolutionary consistency. In 
section R7, we discuss some outstanding issues and 
possible improvements, extensions and applications of 
the presented algorithm and methods. The topics in¬ 
clude the risks associated with the naive application of 
our algorithm or methods to reconstructed alignments. 
The sections in Methods describe the settings for numerical 


analyses (Ml) and simulation analyses (M2). And the sec¬ 
tions in Supplementary methods in Additional file 1 explain 
methodological details on concrete perturbation calcula¬ 
tions and the first-approximate algorithm. 

This paper basically uses the same conventions and 
notations as used in [22]. Briefly, a sequence state s (e S 11 ) 
is represented as an array of sites, each of which is 
equipped with an ancestry index (v x e T). 1 (In this study, 
we focus on indels. Hence, we do not consider the residue 
states of sequences. For the incorporation of residue states 
and substitutions, see, e.g., [37, 38].) And each indel event 
is represented as an operator acting on the bra-vector, (s|, 
representing a sequence state. More specifically, the oper¬ 
ator Mj(x , /) denotes the insertion of / sites between the 
x-th. and (x + 1) th sites, and the operator M D {x B , Xe) de¬ 
notes the deletion of a sub-array between (and including) 
the x B -Xh and the x £ -th sites. Readers unfamiliar with the 
bra-ket notation (as adapted from theoretical physics (e.g., 
[27, 28])) can simply regard a bra-vector ((s|), a ket-vector 
(| s')) and an operator (M) as convenient reminders of a 
row vector, a column vector and a matrix, respectively, just 
as in the standard representation of a continuous-time 
Markov model. (See section SA-1 in Additional file 2 of 
[22] for the equivalence between them.) And, also as in 
[22], the following terminology is used. The term “an indel 
process” means a series of successive indel events with both 
the order and the timing specified. And the term “an indel 
history” means a series of successive indel events with only 
the order specified. 

As a last note, an “alignment,” a “PWA” and a “MSA” 
in this paper will mean their homology structures [39]. 
Briefly, the homology structure of an alignment is a set 
of alignment columns (i.e., sets of homologous sites in 
the aligned sequences) that are spatially arranged in a 
looser way than in a usual alignment, i.e., constrained 
only by the spatial relationships between the sites within 
each aligned sequence. 2 

Results and discussion 

[Descriptions given here are somewhat sketchy. For 
methodological details, as well as the relationship with 
the results of [22], see the relevant sections of Supple¬ 
mentary methods in Additional file 1.] 

R1. Perturbation expansion of multiplication factor for 
local alignment 

In this section, we briefly explain some results in [22] 
that are essential for this paper. Similarly to that of the 
probabilistic alignment methods in general, one of the 
main goals of our theoretical formulation (presented in 
[22]) is to calculate the absolute occurrence probabilities 
of the alignments and to compare the calculated align¬ 
ment probabilities. Therefore, unless stated otherwise, 
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the probabilities considered in this paper are not condi¬ 
tioned on a particular alignment (or even on extant se¬ 
quences). (Once the absolute probabilities are calculated, 
such conditional probabilities (e.g., of indel histories) 
could be obtained by dividing the absolute probabilities 
of the outcomes (e.g., the indel histories) by the absolute 
probability of the condition (e.g., the resulting align¬ 
ment), similarly to Eq. (SM-5.3.6a) in Additional file 1.) 

Let P[{a(^ t s D ), [t h t F \) | (Z 1 , t /)] be the probability 
that a PWA (n(s A , s D )) between an ancestral sequence 
state (s A ) and a descendant ( s D ) result from the evolu¬ 
tion of a sequence during a time interval ([ t b t F ]), 
given s A at t b In [22], we formally showed that 
P[(a(s A , s D ), [t b t F \) | (s A , £/)] is given as a series: 

P[(a:(AA), [t h t F ]) | (A f/)] 

I K. >-)] 

(1) 

Here, N^ais*, s°)\ is the minimum number of indels 
required to create tf(/\s D ). And P^Ka^yS 0 ), [t b t F }) \ 
(A, tj)] is the fraction of P[(a(s A ,s D ), [t h t F ]) \ (s A , tj)] 
contributed from all N-event indel histories that can yield 
n(/\ s D ). A “preserved ancestral site” (PAS) is a site of 5 A 
that was hit by no indel and thus was preserved all 
through [t b t F ]. Now, using some (but not necessarily all) 
PASs, we partition a(A, s D ) into “local regions” (i.e., inter- 
PAS regions), y l5 y 2 , ..., y , in which all potentially 
causative indels are confined. In [22], we derived the two 
conditions. 

Condition (i): Each indel rate parameter is independent 
of the portion of the sequence state outside of the local 
region where the indel occurred. 

Condition (ii): The increment of the exit rate due to 
each indel event is independent of the portion of the 
sequence state outside of the local region where the 
indel occurred. (The “exit rate” of a state is the rate at 
which the system “exits” the state, that is, the total 
rate at which the state changes to any of other states.) 

Under these conditions, the PWA probability, Eq. (1), 
can be factorized as: 

P [(«($*, S°), [tp t F ]) | (s^fj)] 

= p [([], [tht F _]) | (A */)] fl &»!/*; MA A)> Mf]) | (A u )] 

K= 1 

(2) 

Here P[([], [t b t F \) | (. s A , tj)\ is the probability that 
the sequence underwent no indels during [t b t F ], given 
/at t h And ji P [y K ; («(/, s°), [ti,t F ]) | (/, f,)] is 
the multiplication factor contributed from the local re¬ 
gion, y K . Because the multiplication factor is a summation 


of contributions over all local indel histories that can yield 
the local PWA confined in y K , it can also be expressed as a 
series similar to Eq. (1): 

fair* («(A A’ I^Tf]) | (^, k)] 

= s °)’ I ( sA> 

( 3 ) 

Here, , sP); y K ] is the minimum number of 

indels required for the portion of n(A, 5 D ) in y K . And the 
term (A P (A0 [y /c ; (n(/, 5°), fe^]) | (A, £/)] is the por¬ 
tion of the multiplication factor contributed from all 
local-PWA-consistent A^-indel local histories in y K . 3 (For 
more details, see the first half of SM-1 of Supplementary 
methods in Additional file 1.) 

It should be noted that the multiplication factor, 

(e.g., in Eq. (2)), is not a probability; actually, it is not 
even a conditional probability, and it can exceed 1 
(unity) in some cases (in such manners that the entire 
right hand side of Eq. (2) will always be less than 1 
(unity)). In this sense, the “generalized HMM” given by 
Eq. (2) differs from normal HMMs. The contribution of 
each local indel history to a multiplication factor is the 
ratio of the probability of the history (given an ancestral 
state) to the probability that the ancestral state under¬ 
went no indel. (See Eq. (SM-1.7) in Additional file 1 for 
the mathematical definition.) When comparing the con¬ 
tributions from two different sets of histories (potentially 
giving rise to the same local alignment), the denomin¬ 
ator (i.e., the probability of no indel) is usually identical. 
Therefore, in general, the comparison of two multiplica¬ 
tion factor contributions gives the same result as the 
comparison of the corresponding probabilities. (Similar 
notes apply also to the analyses of local MSAs below.) 

Similar arguments hold also for the probability, P 
[a[si, 52,..., s N x] I T\, that a MSA {a[s u 5 2 ,..., s N x]) of 
N x sequences, 5i, 5 2 , ..., s N x, results from the evolution 
along a given phylogenetic tree ( T) [22]. Basically in line 
with the idea in [18, 19, 40], we can build up the prob¬ 
ability of a MSA, first by multiplying the root state prob¬ 
ability and the probabilities of ancestor-descendant 
PWAs along branches, and second by summing such 
products over all MSA-consistent ancestral states. The ab 
initio MSA probability thus composed can be expressed as 
a series: 

P[a[s u s 2 ,..., s^x] | T] (4) 

= TT=AT m /(F>[ a [ Sl ’ S 2<-> %*] \ T \ 

Here, N min is the minimum number of indels required 
for creating the MSA. (For simplicity, we omitted the 
obvious dependence of A min on the MSA and the tree.) 
And P( N )[a[s\ y 5 2 ,..., s N x] \T] is the portion of the 


Ezawa BMC Bioinformatics (2016) 17:397 


Page 5 of 21 


probability contributed from all MSA-consistent M-event 
indel histories. A MSA-counterpart of a PAS is a gapless 
column, which indicates that the corresponding site was 
hit by no indel throughout the evolution along T. (Here¬ 
after, a gapless column in a MSA is also called a “PAS.”) 
Using some PASs, we partition the MSA into local re¬ 
gions, Ci, C 2 , Ci< max • Meanwhile, there are infin¬ 
itely many possible root sequence states (. s Root ' s) 
consistent with the MSA. Among them, we choose one 
as the “reference” root state ( s Root ). Then, in addition to 
the aforementioned conditions (i) and (ii), we impose 
the following condition. 

Condition (iii): the (prior) probability of each root 
state (s Root ) is given by the probability of s Root multiplied 
by the product of factors over the local regions, where 
each factor depends only on the difference between s Root 
and s Root in a local region. 

Under these conditions, the MSA probability is factor¬ 
ized as: 

P[a[si, 52 ,.*., s N x] I T] 

Kmax ~ 

= P 0 [4 oot I T] nM P [a[ Sl ,5 2 ,...,s^]; sj 0 *; C K | T ] 

K =1 

(5) 

Here, P 0 [s Root \ T\ is the probability that the root se¬ 
quence state is s Root and that it was hit by no indel all 

across T. And M P [n[si, s 2 , • s N x]; s Root ; Ck \ T] is 

the multiplication factor contributed from the local re¬ 
gion, C K . As in Eq. (4), the multiplication factor also can 
be expressed as a series: 

M P [a[s u s 2 ,..., s N x]-, s* oot ; C K \ T] 

= YZ=N mi „[C K ]Mp(N)[ a l S l’ S2 >-> %*]? S 0°° t ’ C K \ T] 

( 6 ) 

Here, N min [C K ] is the minimum number of indels re¬ 
quired for the portion of the MSA in C K . And the term 
M P (n) [a[si, 5 2 ,..., s N x]; s Root ; C K \T] is the fraction 
of the multiplication factor contributed from all local- 
MSA-consistent AMndel local histories in C K . (For more 
details, see the second half of SM-1 of Supplementary 
methods in Additional file 1.) 

As agued above, under conditions (i) and (ii) (and, in 
addition, (iii) for a MSA), the probability of a given 
alignment is factorized into the product of an overall 
factor and local contributions (as in Eq. (2) and Eq. (5)). 
This factorization could drastically speed up the com¬ 
putation of the probability. However, each local contri¬ 
bution is still a summation over an infinite number of 


indel histories (Eq. (3) and Eq. (6)), and its literally 
exact calculation would take infinitely long. This study 
examines two kinds of approximation methods. One 
is the “first approximation,” which approximates 
each multiplication factor with the “total parsimoni¬ 
ous contribution,” i.e., the summation of contribu¬ 
tions over all possible parsimonious indel histories. 
(It corresponds to the term, ju P (A ^[y K ; (a(s A , s D ), [t h 
t F \) I (s A , t,)} with N = N min [a(s A ,s D \, y K ] or M P (N) 
[a[si, 5 2 , ..., s N x]; Sg 00t ; C k \T] with N = N min [C K \) 
And the other is calculating a practically exact solution 
(or an “exact” solution, for short) of each local contribu¬ 
tion from a local PWA of a certain type (see section R3). 
Especially, we examine the accuracy of the first approxi¬ 
mation by comparing the total parsimonious contribution 
either with the total next-parsimonious contribution (sec¬ 
tions R2 and R4), with the “exact solution” (section R3) or 
with the results of simulations (section R5). Here, the 
“total next-parsimonous contribution” is the summation 
of contributions over all next-parsimonious indel histories. 
(It usually corresponds to the above term with 
N = A/ m<w [a(s A ,s D ); y K ] + lor with N = N min [C K ] + 1). 

In the following sections, we will work with a 
model that satisfies the conditions (i), (ii) and (iii), 
and we will focus on calculating the multiplication 
factor that comes from a single local region (i.e., a 
“local alignment”) flanked by a pair of PASs. As in 
[22], we will work in the state space S H . This means 
that we will calculate the probability of the homology 
structure of each local alignment (e.g., [39]). Let 
AL(s) be the number of sites that a sequence seS 11 
has between the pair of PASs. We will re-assign the 
site numbers so that the left- and right-flanking 
PASs are numbered 1 and AL(s) + 2, respectively, and 
the sites in between them are numbered 2, ..., AL(s) 
+ 1. This will make it easy to apply our theoretical 
formulation [22] to the current situation. We will 
re-assign the ancestries v(l)=L and v(AL(s) + 2) = R 
to the left- and right-flanking PASs, respectively. 
(See endnote (1) for a brief description of the 
ancestry.) And we will usually (but not always) re¬ 
assign the ancestries v(x) = x - 1 to the sites in be¬ 
tween the PASs, x = 2,..., AL(s) + 1, of the ancestral 
sequence, s = s A (for a PWA), or the root sequence, 
s = s Root (for a MSA). See Fig. 1 for an illustration. 

Hereafter, we will often employ shorthand notations 
for the aforementioned (fractions of) multiplication fac- 
tors, e.g., pp(M)[y K ; (a(s A , s°), [t h t F ]) | (/, f,)] for a 
PWA and M P(JV) [a[si, s 2 ,..., s N x]- s R 0 oot ; C K \ T] for a 
MSA, either by omitting the arguments (like Vp(ao” an d 
“M P (tv) ”) or by replacing the arguments with simpler ones 
representing more concrete situations (like “ftpplease (ii); 
AL a ]” or “ M P ( N ^[case (II); AL DVX \ ”). Unless stated 
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a Original local MSA (in dashed box) 


b Reassigning ancestries 
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Fig. i Notation and re-numbering of sites typical in this study, a An example MSA. The C K 's (with k = 1,2,..., 10) label the regions that actually 
or potentially accommodate local indel histories (marked by bottom curly brackets and yellow wedges, respectively). As an illustration, we choose 
the local MSA confined in the region C 4 (the portion in the green dashed box), and re-assign the ancestry indices (in the cells) as shown in panel 
(b). Ancestries L and R were newly assigned to the flanking PASs. The ancestry indices in between the PASs are just examples and not always 
assigned in this way. (The indices in panel (a) could be regarded as hexadecimal numerals, if preferred.) c Subsequences extracted from the local 
MSA. Shown above each site (i.e., each cell) is the site number (i.e., spatial coordinate) re-assigned to it. And the A L(s) on the right of each 
sequence (s = s 1f ..., s 6 or s Root ) is the count of sites in between the PASs. In panels a and b, the boldface characters in the leftmost columns 
stand for the sequence IDs. (More precisely, the number 7 stands for sequence s„ and the 7?' stands for the root sequence ( s Root ).) A dash ('-') in a 
cell represents a gap. In panel b, triple dots ('...') in a cell indicate that the alignment continues outwards. Panel a was adapted from Figure S2 b 
of [22], Panels b and c were adapted from Figure 1 of [43] 


otherwise, we consider the sequence evolution during time 
interval [t b t F ] (for a PWA) or along a given tree, T (for a 
MSA). 

For illustration, we will use the indel evolutionary 
model of Dawg [32], though the analyses could be ex¬ 
tended with due modifications to more general models 
(discussed in subsection R7.1). 4 Its indel rates are space- 
homogeneous and time-homogeneous, and they are 
parametrized as follows. Let L(s) be the length of the 
sequence with state s. The rate of insertion Mj(x , /) is: 

r/(#,/;s,£) = XjfjO) (for x = 0,1, ...,L(s); l = 1,2, 

(?) 


Here, Xj is the total insertion rate (per site),^(/) is the 
insertion length distribution, and Lf° is the cut-off in¬ 
sertion length. The rate of deletion Md(x b , x e ) is: 

r D {x B ,x E ; s, t) = X D f D {l = x E -x B + 1) 

(for #b<L(s); x e > 1\ l m 1,2, 

(8) 

Here, X D is the total deletion rate (per site),^,(/) is the 
deletion length distribution, and L%° is the cut-off dele¬ 
tion length. Consequently, the exit rate from state s is: 
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(= m) 

Z-^x B =-i 


lM X *’ Xe ’ S -*)) 


^x b =-L^° +2 Z-^x E =max{x B , 1} 


— (A/ + Xd)L{s) + ^ DflM/<? [A/,Ai),^)(.)]- 


(9) 


Here, zl D ^[A/, XdJ d (-)\ = A/ + (//>—l)Ax> is a “uni- 

E l co 

1 = 1 1 t) is the 

average deletion length [32]. In this study, we use the 
power-law indel length distribution: fj(l) = / D (/) 

^ =1 /c- 16 , which is among the typical ones 

empirically observed (e.g., [24] and references therein). 
We also set A/ = A/) according to a genome-wide data 
analysis [41], unless otherwise stated. As for the se¬ 
quence state probabilities at the root, we assume a uni¬ 
form sequence length distribution hereafter. 5 See 
sections Ml and M2 of Methods for more specific 
settings. 


R2. Numerical comparison between parsimonious and 
next-parsimonious contributions (1): for local PWAs 

Here, we examine how accurately the first approxima¬ 
tion will estimate the multiplication factor from each 
local PWA by comparing the total parsimonious contri¬ 
bution with the total next-parsimonious contribution, 
both calculated via numerical computations of their ana¬ 
lytical expressions (given in SM-2 of Supplementary 
methods in Additional file 1). In this study, we are con¬ 
cerned only with the homology structures [39] of align¬ 
ments. Hence, local PWAs flanked by a pair of conserved 
ancestral sites (PASs) can be broadly classified into four 
cases, according to the sites between the PASs: (i) no ances¬ 
tral or descendant sites (panel a of Additional file 1: Figure 
SI); (ii) some (AL a > 0) ancestral sites but no descendant 
sites (panel b); (iii) some (AL D > 0) descendant sites but no 
ancestral sites (panel c); and (iv) some (AL a > 0) ancestral 
sites and some (AlP > 0) descendant sites, but with no 
ancestor-descendant homology (panel d). 6 (See Additional 
file 1: Figure S2 for parsimonious and next-parsimonious 
indel histories in case (ii), and Figure S3 for parsimonious 
histories in case (iv).) Our numerical analyses indicated the 
following. In case (i), the total next-parsimonious contribu¬ 
tion ([Appease (/)]) was negligibly smaller than the total 
parsimonious contribution (piplease (i)] (=1)) for any 
realistic situation we likely encounter, as far as a single 
inter-PAS region is concerned. In case (ii) and case (iii), the 
total next-parsimonious contribution (ptp^case (ii); AL a ] 
or pt P( 2 )[case (iii); AL D ]) amounted to 1/2 of the total 
parsimonious contribution (ftplease (ii); AL a ] or 
ft P (i)[case (iii); AL°]), when the size of the local PWA (i.e., 
AL a or AlP) is equal to a threshold value, (AL)wP ~ 1.2/ 
E[viidI (Additional file 1: Table SI and Figure S4). Here 


E[n ID ] (=(A/ + X D )(t F - £/)) is the expected number of 
indels per site during the sequence evolution. For example, 
in typical analyses of neutral genomic sequences from eu- 
therian mammals, the branch length is around 0.2 expected 
substitutions per site (e.g., [42]). And the total indel rate 
was estimated as 1/8 of the total substitution rate [35]. 
Using these values, E[n ID \ is approximately 0.2/8 = 0.025, 
which gives the threshold (AL)^ roughly equal to 50 sites. 
For the analyses of more closely related sequences, the 
threshold becomes longer. For example, in a comparison 
between primate sequences, a typical branch length would 
be 0.05 expected substitutions per site (e.g., [42]). Then, 
(AL)Pf } would be roughly equal to 200 sites. In case (iv), 
the total next-parsimonious contribution (ftp^[case (iv); 
AL a > AlP ]) did not substantially exceed 1/2 of the total 
parsimonious contribution (ftplease (iv); AL a , AL d ]) 
until the local PWA or the time interval became quite long 
(Table 1). For more details on these analyses, see SM-2 of 
Supplementary methods in Additional file 1. (Further de¬ 
tails on the calculations for cases (iii) and (iv) are given in 
sections Al.l and A1.2, respectively, in [43].) 

R3. Numerical comparison between parsimonious 
contribution and "exact solution" for local PWAs 

It is difficult to calculate the summed contributions from 
local histories involving more indels, especially in case (iv). 
We could exactly calculate the contribution from a single 
local history involving any number of indels if we use the 
algorithm for a “trajectory likelihood” given by Miklos et 
al. [26]. As we exemplified in Appendix A1.2 of [43], how¬ 
ever, it is already quite hard to enumerate even all the pos¬ 
sible next-parsimonious local indel histories for case (iv). 
Nevertheless, if we consider only cases (i), (ii), and (iii) 
under a (locally) space-homogeneous model, we can work 
out systems of exact integral equations that could in 
principle provide the numerical solutions for the total sum 
of contributions up to a desired level of accuracy, i.e., 

Pp{n ) with a desired upper-bound indel 

count (N id ), at the expense of some time and memory. 

Applying the fundamental defining integral equations 
of our evolutionary model (Eqs.(R4.4, R4.5) in [22]) to 
the local MS As of cases (i), (ii) and (iii), two systems of 
integral equations can be derived. One system is for 
cases (i) and (ii) (see SM-3 of Supplementary methods in 
Additional file 1), and the other is for cases (i) and (iii) 
(described in Appendix A1.3 of [43]). 7 These systems of 
integral equations can be numerically solved by iteration, 
and the results after N ID iterations give the aforemen¬ 
tioned jt^p 10 ' 1 (see SM-3 of Supplementary methods). A 
naive implementation of this iteration (based on Eq. 
(SM-3.2’) in Additional file 1) would be very slow, with 
the time complexity of 0(N ID (L co ) 2 (N P ) 2 ). Here L co is 
the cut-off indel length and N P is the number of equal- 


Ezawa BMC Bioinformatics (2016) 17:397 


Page 8 of 21 


Table 1 Perturbation analysis on local PWA probabilities in 
case (iv) 


{AL a , a L°) 

0.01 indels/ 
site 

0.04 indels/ 
site 

0.1 indels/ 
site 

0.2 indels/ 
site 

(1,1) 

0.003 

0.010 

0.024 

0.045 

(3, 1) 

0.021 

0.084 

0.204 

0.393 

(3, 3) 

0.042 

0.166 

0.402 

0.768 

(5, 5) 

0.073 

0.283 

0.672 

1.256 

(10, 1) 

0.064 

0.246 

0.572 

1.013 

(10, 10) 

0.149 

0.561 

1.292 

2.288 

(25, 1) 

0.151 

0.547 

1.112 

1.541 

(25, 4) 

0.198 

0.723 

1.519 

2.234 

(30, 10) 

0.288 

1.038 

2.164 

3.072 

(100, 1) 

0.537 

1.333 

1.507 

1.574 

(100, 3) 

0.607 

1.593 

1.894 

2.033 

(300, 1) 

1.165 

1.394 

1.427 

1.527 


Each cell shows the ratio of the total next-parsimonious contribution to the total 
parsimonious contribution, when there are A i A ancestral sites and A L° descend¬ 
ant sites in between the PASs. Each column is labeled with the expected number 
of indels per site ((A/+A D )(f F - f,)). See section Ml of Methods for the parameter 
setting. Because of the symmetry between probabilities under the time reversal, 
the ratio for ( AL a , AL°) = (L v L 2 ) is identical to that for ( AL a , AL°) = ( L 2 , O 
when A/ = A D . Thus we only showed the results for AL a > AL°. The ratios that are 
less than 0.5 are shown in boldface. This table is identical to Table 2 of [43] 

sized sub-time-intervals introduced for the numerical 
time integration. However, we devised a faster algorithm 
for this iteration (based on Eqs.(SM-3.4a,b) in Additional 
file 1), with the time complexity of 0(N 1D L co (L co + 
Np)N P ). And we implemented it in Perl (available in 
Additional file 2). Typically, the computation finished in 
the order of an hour (using a single thread) in a Macin¬ 
tosh computer with two quad-core 2.26 GHz Intel Xeon 
processors and 8GB memory. One round of the compu¬ 
tation provides the multiplication factors for all local 
PWA sizes {AL = 0, 1,..., L co ) and for time-interval 
sizes of k(tp- t^/Np with k = 1, 2,..., N P . Our numer¬ 
ical analyses confirmed that N ID = 200 would be enough 
to give the practically exact (or “exact”) solution for the 
local PWAs of 300 sites or less in likely situations of 
phylogenetic-level sequence analyses (data not shown). 
Thus, we used the results of N ID = 200 iterations as the 
“exact” multiplication factors, and compared the parsi¬ 
monious contributions with them (Fig. 2). We define an¬ 
other threshold value, (AL^ol, at which the parsimonious 
contribution becomes 1/2 of the “exact” solution. The 
results indicated (AL)q^ ~ 1.6 /E[ni D \ (Additional file 1: 
Table SI). Thus, (AL)ol is approximately 4/3 of (AVj^P 
in the previous section, implying that (AVj^P actually 
gives a somewhat conservative criterion for the goodness 
of the first approximation. A fringe benefit of these iter¬ 
ation analyses is that we can also assess the “w-th ap¬ 
proximation,” which is given by ftp (Fig. 2). We define 
{AL)^l as the local PWA size at which the n- th 


approximation becomes 1/2 of the “exact” solution. It 
seemed that {AL)$ > nx {AL)Pl in general (Additional 
file 1: Table SI). This suggests the benefit of incorporat¬ 
ing non-parsimonious local indel histories, especially 
when we deal with long local PWAs resulting from a 
long time evolution. 

Now that we have “exact” probabilities for local PWAs 
of cases (i), (ii), and (iii), it would be interesting to exam¬ 
ine their behaviors. Panel a of Fig. 3 shows the log-log 
plots of exact solutions for different time intervals (in 
units of the expected number of indels per site). We see 
that even finite-time transition probabilities are well ap¬ 
proximated by the power-law, with very high correlation 
coefficients for the log-log plots (0.9998 or more in the 
absolute value, see Additional file 1: Table S2). And 
panel b of Fig. 3 indicates that, as the time interval in¬ 
creases, the power-law exponent deviates gradually (yet 
only slightly) from its value for the instantaneous indel 
rates (1.6 here). Meanwhile, the coefficient seems almost 
proportional to the time interval (panel c). The slopes 
for these quantities differed for different values of the ra¬ 
tio, Aj'.Ap). (Additional file 1: Table S2 gives also their 
numerical values for some representative cases.) These 
results may be useful for future data analyses on indels, 
e.g., when inferring the power-law exponents for the 
indel rate parameters from the comparison of relatively 
divergent homologous sequences. 

Similarly to our ab initio formulation itself, these 
systems of integral equations can accommodate any 
practical indel length distributions. Therefore, we could 
even examine cases where insertions and deletions follow 
different length distributions and/or models that incorpor¬ 
ate transposon insertions (e.g., [44, 45]) as well. Such ana¬ 
lyses should be interesting and important. 

R4. Numerical comparison between parsimonious and 
next-parsimonious contributions (2): for local MSAs 

We next studied some typical cases of local MSAs. 
(The analytical calculations of the multiplication factors 
are detailed in SM-4 of Supplementary methods in 
Additional file 1.) We only examined MSAs resulting 
from the evolution along a 3-OTU tree (Fig. 4a). This is 
because a next-parsimonious indel history typically dif¬ 
fers from its parsimonious counterpart in the sequence 
state at the internal node that phylogenetically delimits 
an indel event. We examined the following four cases, 
which differ in the sets of homologous sites in between 
the PASs: (I) none of the three sequences has any site 
(Fig. 4b); (II) two sequences share a homologous run of 
sites, but the third sequence has no site (Fig. 4c); (III) 
one sequence has a run of sites, but the other two se¬ 
quences have no site (Fig. 4d); and (IV) one sequence (si) 
has a run of sites, another sequence (s 3 ) has no site, and 
yet another sequence (s 2 ) shares the homologous sites 
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Fig. 2 Iterative solutions of integral equation system for local PWA probabilities. The results shown in this figure apply to both case (ii) and case 
(iii) local PWAs. In both panels, the abscissa is the number of sites between PASs (i.e., AL a for case (ii) and A/° for case (iii)), with zero corresponding to case 
(i). Each panel shows the ratio of the total contribution from histories with up to (and including) N D indels to the "exact solution" of the multiplication 
factor. See section Ml of Methods for the parameter setting. Panel a is for 0.04 expected indels per site, and panel b is for 0.2 expected indels per site. 

Panels a and b in this figure are modified versions of panels B and C, respectively, of Figure 3 of [43] 

v j 


with si except a contiguous subset of sites it lacks (Fig. 4e). 
In case (I), similarly to case (i) local PWAs, the total 
next-parsimonious contribution ( M P ^\case (/)]) was 
negligibly smaller than the total parsimonious contribution 
(M P ^[case (/)]). The comparison in case (II) reduces to 
that in case (ii) local PWAs, thanks to the phylogenetic 
consistency condition that the ancestral sequence 
states must satisfy (e.g., [46, 47]). The next- 

parsimonious local indel histories in case (III) are 
classified into two broad types: (A) those that have the 
same ancestral sequence state as the parsimonious 
history, and (B) those that have different ancestral states 
than the parsimonious history. The comparison of the total 
parsimonious contribution ( M P ^ [case (III); AL m ~\ ) 
to the total contribution from type (A) histories 
(Mp( 2 ) [case (III); ( A ); AL m ]) reduces to the compari¬ 
son in case (iii) local PWAs. Thus, we can focus on the 
total contribution from type (B) histories 
(Mp( 2 ) [case (III); (B); AL mi \). Our numerical analyses 
showed that this contribution is much smaller than the 
parsimonious contribution (Fig. 5), even if the branch or 
the local MSA is quite long. Actually, the relative contri¬ 
bution decreased as the local MSA got longer (Fig. 5). In 
case (IV), the total next-parsimonious contribution 
(Afp( 3 ) [case (IV); AL m , AL m ]) did not substantially 
exceed 1/2 of the total parsimonious contribution 
(M P ( 2 ) [case (IV); AL m , AT 02 ]) until the local MSA or 
the branches became quite long (Table 2). 8 

Taken together, the results in sections R2-R4 suggest 
that the first approximation by the parsimonious indel 
histories alone will estimate the multiplication factor for 


each local alignment fairly well, as long as the local 
alignment size and the branch lengths (or the time inter¬ 
val) are at most moderate. 

R5. Simulation analyses to see goodness of first 
approximation for local MSAs 

Thus far, we examined all cases of local PWAs and some 
typical cases of local MSAs. To study a much wider 
variety of local MSAs, we developed an algorithm that 
calculates the first approximation of the probability of a 
given MSA under a given parameter setting including a 
phylogenetic tree. Briefly, the algorithm first chops the 
input MSA into gapped and gapless segments. Second, it 
attempts to enumerate all parsimonious indel histories 
that can give rise to each gapped segment (i.e. local 
MSA) via what we call a “local multi-path downhill 
search” algorithm. Third, it computes their contributions 
to the multiplication factor for each gapped segment. And 
finally, it computes the first-approximate MSA probability 
as the product of an overall factor and the total parsimoni¬ 
ous contributions to the multiplication factors from all 
gapped segments. (For details on the algorithm, see SM-5 
of Supplementary methods in Additional file 1, as well as 
Additional file 1: Figures S5-S8.) 

After manually validating the sub-algorithm to enumer¬ 
ate all parsimonious indel histories (described in [48]), we 
conducted simulation analyses using Dawg [32]. (See 
section M2 of Methods for the settings of the simula¬ 
tions.) We created five homogeneous sets of simulated 
MSAs, namely, sets 1A, IB, 3P, 3M and 3F. Each of sets 
1A and IB consists of 100,000 MSAs simulated along 
a three-OTU tree that has equally long branches and 
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3 Multiplication factor (for A 7 : X D = 1:1) 


-l ”0.001 indels/site “^O.Ol indels/site 0.04 indels/site 



b Power-law exponent C Coefficient 



Fig. 3 Power-law behaviors of "exact" multiplication factors from case (iii) local PWAs. a Log-log plots of the "exact" multiplication 
factors (pp W ®* 200 ^ [AL\, ordinate) against the local PWA size (A L, abscissa), showing nearly perfect power-law behaviors. Although this panel 
shows the results under A/: X D =1:1 only, the power-law approximation is actually very good also under X,: X D = 1:3andA/:A 0 = 3:1 
(Additional file 1: Table S2). Panels b and c show the power-law exponent (y) and the coefficient (A), respectively, as functions of the 
distance ((A, + X D )(t- ti) indels/site, abscissa) and the rate ratio (A/:A 0 , different curves). Here, we assumed the approximate power-law 
relation, ^q^ 200 ) [aa] ~ A(AL)~ Y . (See Additional file 1: Table S2 also for the results of correlation and regression analyses.) Note that the 
results apply also to case (ii) local PWAs with due modifications 


is rooted at its sole internal node. The expected num¬ 
ber of indels per site along each branch (E[n ID \) is 
0.01 (small) for set 1A and 0.04 (medium) for set IB. 
Sets 3P, 3M and 3F consist of 10,000 MSAs each, 
which are simulated along the trees of 12 primates, 
15 mammals and 9 fast-evolving mammals, respect¬ 
ively (Additional file 1: Figure S9). These sets were 
designed to mimic typically encountered MSAs of se¬ 
lectively neutral genome sequences whose sequence 
divergences are small, moderate and large, respectively. 9 

Every simulation started with a random DNA sequence 
that is 1000 bases long. For reasons of computational 
time, we excluded local MSAs containing gaps longer than 
100 bases. The numbers of subject local MSAs in sets 1A, 
IB, 3P, 3M and 3F were 2,676,332, 7,695,575, 397,455, 
935,553 and 984,321, respectively. Among them, 0.15 %, 


1.38 %, 0.12 %, 0.23 % and 0.49 %, respectively, exhibited 
non-parsimonious ancestral sequence states. (See SM-6 of 
Supplementary methods in Additional file 1 for how the 
MSAs were compared.) 

For each of sets 1A and IB, we compared the absolute 
occurrence frequency of each of the homology structures 
of local MSAs with its first approximate prediction 
(Fig. 6). (See SM-7 in Additional file 1 for how the 
analysis was performed.) The approximation was pretty 
good for both set 1A (Fig. 6a) and set IB (Fig. 6b), with 
correlation coefficients 0.9996 and 0.9975, respectively 
(Additional file 1: Table S3). The scatter plot for set IB 
(Fig. 6b) showed a thin downward projection from 
around the middle of the main diagonal, indicating the 
underestimation of the frequency. However, it disap¬ 
peared after removing the local MSAs in each of which 
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Fig. 4 Gap configurations of local MSAs examined in perturbation analyses, a The 3-OTU tree used in the perturbation analyses. A node (open 
circle) is labeled n,- (/'= 1,2 or 3) (external) or n Root (root). A branch is labeled bj (/' = 1,2 or 3). In the following, A L Dl denotes the number of sites that 
the sequence at node n, has between the PASs. b Local gap configuration in case (I). c Case (II) with A L m = A L D2 = 5. d Case (III) with A L m =5. 
e Case (IV) with A L m = 5 and A L° 2 = 3. The figure is a modified version of Figure 4 of [43] 

V_I_ 


one or more unobservable indels are expected (Fig. 6c). 
This extends the conclusions in sections R2 and R3 that 
the first approximation estimates the probability of a 
local PWA fairly well as long as its size is within a 
threshold (( AL)§ r 5 p) or (AL)q I). 

Then, for each of simulated sets 1A, IB, 3P, 3M and 
3F, we first examined the relative frequencies of actual 
occurrences among different sets of parsimonious ances¬ 
tral states consistent with each local MSA. Then we put 
the relative frequencies into 20 bins of width 0.05 each. 
And, finally, we compared the average frequencies in the 


bins with their first-approximate predictions (Fig. 7 and 
Additional file 1: Figure S10). The predictions were 
shown to estimate the actual relative frequencies quite 
well, with the correlation coefficients ranging from 0.997 
to 0.99999 (Additional file 1: Table S4). (See SM-8 in 
Additional file 1 for how this analysis was done.) 

The results in this section suggest that the first ap¬ 
proximation would work fairly well also for a majority of 
local MSAs we are likely to encounter, as long as the 
local MSA and the branches are at most moderately 
long. 10 


-©-0.01 indels/site -*-0.04 indels/site 0.1 indels/site 
-$“0.2 indels/site -B-0.4 indels/site 



Fig. 5 Perturbation analysis on local MSA probabilities in case (III). The graph shows the ratio of the contribution of the type (B) next-parsimonious 
history to that of the parsimonious history. The former history consists of a deletion along each of branches 2 and 3, and the latter consists of an 
insertion along branch I.The ratio is shown as a function of the length of the gapped segment (AL D1 , abscissa) and the expected number of indels per 
site along each branch (different curves). The figure is a reduced and modified version of Figure 5 of [43] 
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Table 2 Perturbation analysis on local MSA probabilities in case 
(IV) 


(A L°\ A L° 2 ) 

0.01 indels/ 
site 

0.04 indels/ 
site 

0.1 indels/ 
site 

0.2 indels/ 
site 

(2, 1) 

0.004 

0.016 

0.037 

0.067 

(3, 1) 

0.016 

0.063 

0.150 

0.279 

(3,2) 

0.012 

0.049 

0.118 

0.225 

(10, 1) 

0.050 

0.190 

0.432 

0.751 

(10,2) 

0.060 

0.225 

0.509 

0.895 

(10,8) 

0.060 

0.231 

0.543 

0.980 

(10,9) 

0.047 

0.182 

0.421 

0.740 

(30, 1) 

0.140 

0.487 

0.915 

1.203 

(30, 5) 

0.163 

0.563 

1.118 

1.645 

(30, 25) 

0.173 

0.620 

1.251 

1.850 

(30, 29) 

0.139 

0.486 

0.909 

1.189 

(100, 1) 

0.418 

0.985 

1.180 

1.304 

(100, 99) 

0.418 

0.981 

1.170 

1.290 


Each cell shows the ratio of the total next-parsimonious contribution to the 
total parsimonious contribution. In each local MSA, the first, second and 
third sequences have A L m sites, A L° 2 (<A L m ) sites and zero site, respectively, 
in between the PASs. Each column is labeled with the expected number of 
indels per site along each of the three branches. The ratios that are less than 
0.5 are shown in boldface. This table is identical to Table 3 of [43]. See section Ml 
of Methods for the parameter setting 

R6. Examining other indel models and methods in light of 
our formulation 

One of the major merits of our ab initio perturbative 
formulation is that it can be applied to considerably real¬ 
istic evolutionary models of indels [22]. Therefore, it will 
enable us to examine, e.g., the parameter ranges where 
other indel probabilistic models can well approximate 
the ab initio alignment probability under a fairly realistic 
evolutionary model. 

First we briefly study the goodness of approximation 
by the geometric indel length distribution, which most 
of commonly used indel models are based on. As in the 
previous sections, we use the power-law distribution of 
the indel length as a refer¬ 

ence. Then, we fitted the scaled geometric distribution, 
f SG (l; A, q) = A(l-q)q l ~ 1 , to the power-law under a 
least-square criterion. The best-fit parameters were A LS 
= 0.7125 and q LS = 0.3957. Calculating the ratio, R LS {1) = 
f SG (l; A ls , qLs)lfik(J)> for different indel lengths, we 
find, e.g., R LS ( 5) = 0.3168, R LS (7) = 0.08495 and £ LtS (13) = 
8.775 xlO -3 . The ratio decreases rapidly as the indel 
length increases. If, for example, we allow the ratio as 
small as 1/3, the geometric distribution is regarded as a 
decent approximation only for / < 4. With the 

indels with / > 5 account for about 30 % of all indels. 
These mean that, for 30 % of actually occurring indels, 
the geometric distribution substantially underestimates 
their frequencies even according to the above lenient 


criterion . This reconfirms the importance of using bio¬ 
logically realistic indel length distributions, as pointed 
out, e.g., in [23, 24]. 11 

Next, as an example of indel models that incorporate 
some biological realism, we investigate the HMM of 
Kim and Sinha [36]. Their HMM can accommodate 
power-law indel length distributions. However, similarly 
to most other HMMs and transducers, it cannot cor¬ 
rectly handle overlapping indels along the same branch, 
although it can handle overlapping indels along different 
branches. Another characteristic of their method is that 
it applies the same indel length distributions to all 
branches. The behaviors of the “exact” solutions (Fig. 3, 
Additional file 1: Table S2) indicate that their HMM 
could approximate the probabilities of local PWAs fairly 
well in cases (i), (ii) and (iii), as long as branch lengths 
are reasonable for phylogenetic analyses. Almost the 
same conclusions were drawn also from the analyses 
using up to next-parsimonious contributions in the per¬ 
turbation expansion, Eq. (3). (See SM-9 of Supplemen¬ 
tary methods in Additional file 1.) Regarding case (iv) 
local PWAs, however, our analysis indicated that their 
HMM could substantially underestimate the ab initio 
probabilities, especially when long indels are involved 
(Table 3). (How we performed this analysis is also de¬ 
scribed in SM-9.) This is because their HMM, like most 
HMMs and transducers, neglects 1/3 of the effects of 
non-overlapping indels (panels a, b and c of Additional 
file 1: Figure S3), as well as most effects of overlapping 
ones (panels d and e), that can yield each case (iv) local 
PWA. (Briefly, most HMMs neglect one of panels b and 
c. See SM-9 for details.) These results suggest that a po¬ 
tentially effective way to improve the accuracy of the 
HMM of Kim and Sinha [36] would be to modify the 
transition probabilities between a deletion-type block 
and an insertion-type block. This measure will enable to 
incorporate the effects of overlapping indels in case (iv). 

As exemplified above, our ab initio perturbative 
formulation provides other indel probabilistic models 
with a sound reference point, under which the models 
can be examined to improve their accuracy and evolu¬ 
tionary consistency. A related topic is the Chapman- 
Kolmogorov (CI<) equation, which must be satisfied by 
genuine stochastic evolutionary models. Unfortunately, 
most of the currently common indel probabilistic 
models violate the CI< equation (as argued, e.g., briefly 
in [41]). Because our perturbative formulation satisfies 
the CI< equation up to any desired order of the perturb¬ 
ation expansion (Appendix A3 of [49]), our formulation 
could also examine the effects of the violation of the CI< 
equation. For example, under a geometric indel length 
distribution, the effects become conspicuous only when 
the indel lengths exceed a “critical value” of 0(10), where 
the geometric distribution substantially underestimates 
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Fig. 6 Simulation analyses on absolute alignment frequencies. Each 
panel compares the predicted absolute frequency of each local 
homology structure (ordinate) against the number of times that it 
actually occurred in a simulated dataset (abscissa). The predicted 
absolute frequency was calculated using Eq. (SM-7.1) in Additional 
file 1. Note the logarithmic scaling for both axes, which tends to 
exaggerate sampling errors on the lower-left region in each panel. 
Panels a, b and c, respectively, show the results with the simulated 
sets 1A, IB and set IB after removing long gapped segments. The 
panels a-c are reformatted versions of panels A-C of Figure 29 of [48] 


the real indel frequencies. (See, e.g., subsection 2.3 of 
[43].) This seems to explain the results of past studies 
(e.g., [50, 51]), which did not detect remarkable effects of 
the violation of the CI< equation (or the effects of its 
cause, i.e., inadequately incorporating overlapping indels). 


R7. Outstanding issues and possible improvements, 
extensions and applications 

Here, we briefly discuss some outstanding issues and 
their possible solutions in the forms of methodological 
improvements and extensions, and also possible applica¬ 
tions of the (improved/extended) methods to practical 
problems. (For more details on most of these topics, see 
Discussion of [48].) 

R7.1. Possible improvements and extensions of our 
computational methods and algorithms 

In this study, we successfully constructed two integral 
equation systems to calculate “exact” multiplication 
factors for case (i), (ii) and (iii) local PWAs. For case 
(iv) local PWAs (Additional file 1: Figure Sid), we only 
provided methods to analytically calculate the total 
parsimonious (i.e., 2-indel) and total next-parsimonious 
(i.e., 3-indel) contributions. Although in principle we 
could calculate contributions from indel histories with 
more than 3 indels each, the question should be how 
we can do this within a reasonable amount of time. 
Even if we can construct an integral equation system 
for case (iv), it is expected to contain terms with 
complex gap configurations, and thus it would be difficult 
to solve it “exactly.” Therefore, a key for this case should 
be how we can reasonably quickly obtain approximate 
multiplication factors each of which estimates the exact 
factor more accurately than the summation over all parsi¬ 
monious and next-parsimonious contributions. 

The main purpose for the current algorithm to calculate 
the first-approximate probability of a given MSA (SM-5 in 
Additional file 1) was to see whether or not the first 
approximation works also for local MSAs of general types. 
This algorithm merely constitutes the first step toward an 
automatic application of our ab initio perturbative formu¬ 
lation. Consequently, the algorithm still has some rooms 
for improvements. For example, the algorithm could be 
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Fig. 7 Simulation analyses on relative frequencies among local indel 
histories. Each panel compares the predicted relative frequencies 
(ordinate) against the actual relative frequencies in simulations 
(abscissa). The relative frequencies are among parsimonious local 
indel histories that potentially yield the same local MSA. A blue 
diamond, a red 'X' and a black cross represent a bin of all 
parsimonious local indel histories, that of most likely (ML) 
parsimonious histories, and that of least likely (LL) parsimonious 
histories, respectively. Panels a, b and c show the results with sets 
1A, IB and 3M, respectively. See section SM-8 in Additional file 1 for 
how the analysis was performed. Panels a and b are reformatted 
versions of panels D and E, respectively, of Figure 29 of [48] 

V_____ 


very slow when applied to a local MSA containing a long 
gap. Roughly speaking, the length of time consumed by 
the algorithm applied to an input MSA is the summation 
of the lengths of the consumed time over all local MSAs 
in it. We estimated that the algorithm applied to each 
local MSA has the time complexity roughly greater than 
0(B2 NshG ). (See subsection Dl.l of [48] for details.) Here, 
B is the number of blocks of distinct gap patterns in the 
local MSA, and N shG is the number of short gaps that are 
spatially overlapping and phylogenetically neighboring the 
long gap. For example, if B - N S hg = 20, the time com¬ 
plexity is greater than 0(20 x 2 20 ) « O(10 7 ). B should be 
roughly on the order of N S hG- And N S hG is roughly ex¬ 
pected to be around E[NshgI ~ (E[nm i] + E[n ID2 \) x AL 
Here E[n im \ and E[n ID2 ] are the expected numbers of 
indels per site along the two neighboring branches of the 


Table 3 Comparison of Kim-Sinha's probability with ab initio 
probability for case (iv) local PWA 


(A L a , AL°) 

Ratio (= ref/KS) a 

Overlapping b 

(1, 1) 

1.667 

0.167 

(3, 1) 

1.883 

0.383 

(3, 3) 

2.449 

0.949 

(5, 5) 

3.325 

1.825 

(10, 1) 

2.165 

0.665 

(10, 10) 

5.572 

4.072 

(25, 1) 

2.355 

0.855 

(25, 4) 

4.714 

3.214 

(30, 10) 

8.300 

6.800 

(100, 1) 

2.561 

1.061 

(100, 3) 

4.896 

3.396 

(300, 1) 

2.659 

1.159 


Each row gives values for a local PWA with A L A ancestral sites and AE D 
descendant sites in between a pair of PASs. See section Ml of Methods for the 
parameter setting. When A t = X D , the ratio for ( AL a , AlP) = (7 q , L 2 ) is identical 
to that for ( AL a , AL°) = ( L 2 , C). Thus, we only showed the results for 
AL a > AL d . The ratios less than 2 are in boldface. This table is a modified 
version of Table 4 of [43] 

a The ratio of the ab initio probability to the corresponding probability given 
by the HMM of Kim and Sinha [36], in the limit where the time interval (or 
branch length) approaches zero 

b The portion of the ratio contributed by overlapping indels (such as those in 
panels d and e of Additional file 1: Figure S3) 
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branch where the indel resulted in the long gap. (These 
three branches share the node that accommodates the 
subsequence aligned with the long gap.) And A L is the size 
of the local MSA. For example, if we assume a rather 
large value, E[n ID1 ] = E[n ID2 \ = 0.4/8 = 0.05, we have 
E[N shG \ « 0.1 x AL. In this case, we expect N S hG ~ 20 
almost always when AL = 200. Thus, such local MSAs 
could virtually stop the current algorithm. 12 One way 
to quickly process a local MSA containing long gaps 
should be to treat the gaps hierarchically, first long gaps 
alone and second the remaining short and medium 
gaps (Additional file 1: Figure Sll; see Discussion Dl.l 
of [48] for more details). If this strategy indeed works, 
the 0(2 NshG ) component of the time complexity would 
reduce to 0(2A/ S / zG ), because the short gaps that are 
phylogenetically neighboring the long gap can be han¬ 
dled independently of one another (panel d of Figure 
Sll). Another very similar strategy should be to narrow 
down the ancestral sequence states to be searched for, 
regardless of the presence/absence of gapless columns , 
by exploiting the “phylogenetic correctness” condition 
(e.g., [46, 47]). The condition must always be satisfied 
by the ancestral sequence states consistent with MSAs, 
and thus it should be very powerful. 

Another possible improvement should be to incorpor¬ 
ate non-parsimonious indel histories so that we can en¬ 
hance the accuracy of the probability estimation. As in 
section R4, we can classify non-parsimonious histories 
into two broad categories: (A) those each of which 
shares the set of all ancestral sequence states with a par¬ 
simonious history, and (B) those that share the set with 
no parsimonious history. Each non-parsimonious history 
in category (A) yields the same ancestor-descendant 
PWAs along all branches as a parsimonious history does 
(section R7 of [22]). Hence, we could easily incorporate 
the effects of category (A) histories by using local PWA 
multiplication factors that take account of non- 
parsimonious contributions, as we calculated in sections 
R2 and R3. As the result in section R4 (Fig. 5) indicates, 
this could considerably improve the accuracy relatively 
easily. Incorporating histories in category (B) should re¬ 
quire an algorithm to systematically enumerate such his¬ 
tories. Some hints may come from the examples in 
section R4 and SM-4 of Supplementary methods in Add¬ 
itional file 1, and the “branch-and-merge” operation 
(SM-5.2 of Supplementary methods). The real challenge, 
though, should be to devise a method to enumerate such 
histories efficiently. 13 

In this paper, we presented the results of computing 
local alignment probabilities (or multiplication factors) 
under a space-homogeneous model implemented in 
Dawg [32], mainly in order to avoid excessive presenta¬ 
tional complications. At least theoretically, however, the 


computational methods (in Additional file 1) could be 
extended to space-heterogeneous situations relatively eas¬ 
ily. All we have to do is substitute space-heterogeneous 
counterparts for the space-homogeneous indel rates (and 
exit rates) in the final formulas in SM-2, SM-3 and SM-4 
(in Additional file 1), and replace multiplication by some 
integer factors (such as (AL a + 1) in Eq. (SM-3.2) and Eq. 
(SM-3.4b) in Additional file 1) with summation over pos¬ 
sible positions (of a position-flexible indel event). The time 
integration can be performed analytically (except in SM-3) 
as long as the model is time-homogeneous. And the com¬ 
putation could be automated as long as the indel rates are 
specified according to some programmable rules. In some 
cases, tricks or approximations may be necessary so that 
the computation (involving the aforementioned summa¬ 
tions) can be finished quickly enough. It should be kept in 
mind, however, that such computation will make sense 
only if the probabilities of the entire alignments are factor¬ 
able. This means that the indel rates (and the exit rates) 
must satisfy the conditions (i) and (ii) (and (iii) for MSAs) 
explained in section Rl, which may bring in some compli¬ 
cations. For example, in the most general indel model we 
currently know to have factorable alignment probabilities 
(described in subsection R8-3 of [22]), each locally hetero¬ 
geneous set of indel rates is confined in a region that does 
not necessarily coincide perfectly with a gapped segment 
(i.e., a local alignment). When the region accommodates 
only one gapped segment (and some gapless columns), 
there should be no serious problem; although each pos¬ 
ition between contiguous gapless columns may experience 
some indels, the effects of such indels should be negligible 
(as shown in sections R2 and R3), allowing us to focus on 
the single gapped segment. On the other hand, a serious 
problem may arise when the region accommodates two or 
more gapped segments. In this case, the contributions 
from the gapped segments (overlapping the region) can 
no longer be factorized, and thus all possible relative or¬ 
ders will have to be considered among indels in different 
gapped segments overlapping the region (while retaining 
the order in each segment). This could substantially slow 
down the computation, especially regarding non- 
parsimonious indel histories (including practically exact 
solutions), and some new measures may be necessary for 
reasonably fast computation. In addition, it should be re¬ 
membered that, in order to pursue further biological real¬ 
ism, one must also overcome some other hurdles, such as 
more realistic boundary conditions and mutations other 
than indels and substitutions (discussed, e.g., in section R9 
of [22] and [37]). 

R7.2. Risks associated with naive applications to 
reconstructed alignments 

Some readers may consider conducting some evolu¬ 
tionary analyses by applying the algorithm presented 
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here to a MSA reconstructed by one of the state-of- 
the-art aligners (reviewed, e.g., in [7]). We strongly cau¬ 
tion the readers that it would be premature to conduct 
such analyses at this point. What we demonstrated here 
is that the algorithm estimates the probabilities quite 
accurately, provided that it is fed a correct MSA . Unfor¬ 
tunately, however, recent analyses (e.g., [38, 52, 53]) 
showed that reconstructed MSAs are considerably 
error-prone, even if they were reconstructed via state- 
of-the-art aligners. Thus, a naive application of the 
algorithm to a reconstructed MSA would likely lead to 
incorrect predictions. Therefore, the readers should 
avoid such analyses whenever possible. Even if they 
need to perform such analyses, the possibility of MSA 
errors must be fully taken into account when interpret¬ 
ing the results. 

R7.3. Possible applications 

Originally, we developed our theoretical formulation 
[22] and the algorithm presented here for the purpose of 
comparing candidate MSAs in terms of their occurrence 
probabilities, i.e., their likelihoods. This purpose should 
be adequately fulfilled considering the accuracy of the 
predicted probabilities under moderate conditions, as 
demonstrated in this paper. If the algorithm can be 
coupled with a sampler that can preferentially explore 
quite likely regions of the MSA space, we could obtain 
an approximate probability distribution of MSAs. Such a 
distribution should be very useful, because a substantial 
fraction of alignment errors turned out to be due to the 
stochastic nature of evolutionary processes [38]. In the 
previous study [22], we showed that the “long indel” 
model [26] is virtually equivalent to our ab initio formula¬ 
tion under space-homogeneous indel rates. Hence, their 
dynamic programming (DP) could be applicable to the 
problem, possibly with some modifications. Although the 
full version of their DP is quite slow, a device similar to 
those used recently (e.g., [21, 41, 54]) might speed up the 
MSA space exploration. It remains to be seen if such a 
device could be successfully adapted to our formulation or 
not. Most of currently available MSA aligners, whether 
they implement probabilistic or single-optimum-search 
algorithms, are based on geometric distributions. Because 
biologically realistic indel length distributions were shown 
to improve the accuracy of pairwise sequence compari¬ 
sons (e.g., [23, 24]), we expect that this will be the case 
also with multiple sequence comparisons. (This expect¬ 
ation was partially confirmed in [38].) 

Up to here, we assumed that the phylogenetic tree is 
given. In many cases, however, the phylogenetic trees 
must also be inferred from the input sequence data. A 
theoretically ideal way would be to infer the joint distri¬ 
bution of MSAs and phylogenetic trees, as it is expected 
to minimize possible prediction biases (e.g., [18, 39, 55, 


56]). A major problem is that such an analysis would be 
very time-consuming in general. In this sense, the trad¬ 
itional method of inferring the phylogenetic tree from an 
input MSA (e.g., [5]) and the incorporation of indel in¬ 
formation into the method (e.g., [57]) would still be use¬ 
ful. When trying to adapt our algorithm or formulation 
to any of these methods, we will have to further speed 
up the calculation of approximate alignment probabil¬ 
ities, especially under the moves in the tree space ex¬ 
ploration, such as the nearest-neighbor interchange 
(NNI) and the sub-tree pruning and re-grafting (SPR). 
At present, it is a totally open question whether we can 
really do this without compromising the accuracy of the 
predicted probabilities. This should be a challenging, 
formidable yet crucial problem of phylogenetic-level 
molecular evolution. 

Conclusions 

In the previous study [22], we proposed a theoretical 
formulation that facilitates the ab initio calculation of 
the probabilities of given PWAs and MSAs under the 
general continuous-time Markov model, which describes 
the evolution of an entire sequence along a time axis via 
indels. And we explicitly demonstrated that, under a cer¬ 
tain set of conditions, each ab initio alignment probabil¬ 
ity is factorable into the product of an overall factor and 
multiplication factors originated from local alignments 
delimited by preserved ancestral sites, thus providing a 
sort of generalized HMMs (Eq. (2) and Eq. (5)). 

In this study (especially in Supplementary methods in 
Additional file 1), we provided some methods and an 
algorithm to concretely calculate the total parsimonious 
contribution and the total next-parsimonious contribu¬ 
tion to the multiplication factor, Eq. (3) or Eq. (6), 
originated from each local alignment, under space- 
homogeneous situations for illustration purposes. Our 
analyses indicated that even the total parsimonious 
contribution approximates the multiplication factor 
fairly well as long as (A/ + A jD )(^ f - tj)AL is within an 
0(1) threshold. Here, (A/ + A jD )(^- £ 7 ) is the expected 
number of indels per site and A L is the local alignment 
size. Moreover, again under space-homogeneous situa¬ 
tions, we deduced two systems of integral equations 
that can be numerically solved to give practically exact 
multiplication factors for local PWAs of cases (i), (ii) 
and (iii). An inspection of the practically exact factors 
indicated that the finite-time transition probabilities in 
these local PWAs keep following the power-law, and 
that the exponent only slightly deviates from the ori¬ 
ginal exponent for the instantaneous indel length distri¬ 
bution. Equipped with these results and new methods, 
the theoretical formulation we proposed in [22] is ex¬ 
pected to provide other indel probabilistic models with 
a sound reference point, which could suggest necessary 
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modifications to improve the accuracy of the models 
(as exemplified in section R6). 

However, considering that the commonly used aligners 
are considerably error-prone (as shown e.g., in [38, 52, 
53]), it would be very risky to naively apply the pre¬ 
sented algorithm or methods to reconstructed MSAs. 
Thus, it should be preferable to first develop programs 
that exploit the fruits of the previous study [22] and this 
study to accurately estimate the uncertainties in, and 
rectify the errors of, reconstructed alignments under a 
genuine stochastic model of sequence evolution via 
indels that is biologically more realistic than almost all 
models studied in the past. 

Methods 

Ml. Parameter settings for numerical analyses 

We performed all numerical analyses in this paper using 
the space-homogeneous indel model implemented in 
Dawg [32] (see Eqs.(7-9)). Unless otherwise stated, the 
total insertion rate was set equal to the total deletion rate 
(that is, A/ = X D ), according to a genome-wide data 
analysis [41]. We used the power-law indel length 
distribution for both insertions and deletions: 

m = f D (i) = /f 6 (/; l co ) = r 16 /(Y,Z k ~ 16 )- 

The power-law exponent of 1.6 is among the typical values 
observed empirically (e.g., [24] and references therein). 
The cut-off indel length, L co , was set at 500 sites for the 
perturbation analyses to assess the goodness of the first 
approximation (Tables 1 and 2, Figs. 2, 3 and 5; Additional 
file 1: Tables SI, S2 and Figure S4), whose results were al¬ 
most independent of L co . It was set at 5000 sites when 
assessing the goodness of approximation by the HMM of 
Kim and Sinha [36] (Table 3), because the result stabilized 
around this value of L co . In the perturbation analyses on 
local MSAs (e.g., Table 2 and Fig. 5), we used a 3-OTU 
tree with equally long branches (Fig. 4a). The tree was 
rooted at its sole internal node. For the iterative perturb¬ 
ation analysis (Figs. 2 and 3; Additional file 1: Tables SI 
and S2), the sub-time-interval for the numerical time 
integration was set at E[n 1D \/N P = 0.001 (indels per 
site). 14 For local MSA analyses (both perturbative 
and simulation-based), the uniform sequence length 
distribution was employed as the prior probability of 
the root sequence state. 

M2. Simulations to prepare input MSA sets 

To validate the entire algorithm described in section 
SM-5 of Supplementary methods in Additional file 1, 
we prepared five sets of MSAs using a genuine se¬ 
quence evolution simulator, Dawg [32]. We performed 
all simulations using the same Zipf power-law distribu- 

tion, f D (/) = fj (/) = r Y / (y^ =1 £“ r J , with the 


exponent y = 1.6 and the cut-off indel length of L co (= 
Lf° = Ln°) = 100 bases. The exponent y = 1.6 is typical 
among empirically observed values (e.g., [24] and refer¬ 
ences therein). The cut-off length was chosen in order 
to prevent it from taking extremely long to search for 
parsimonious local indel histories. This is because, with 
our current implementation, the search could be very 
time-consuming when a gapped segment contains at 
least one long gap (see subsection R7.1 of Results and 
discussion for a possible solution). Each of the simula¬ 
tions started with a random ancestral DNA sequence 
that is 1000 bases long. In each simulation, we labeled 
all the internal nodes of the input tree, in order to keep 
the ancestral sequences aligned with the “extant” se¬ 
quences (at the external nodes). Other parameters and 
options were set at default values unless otherwise 
stated. We created five input MSA sets, namely, 1A, 
IB, 3P, 3M and 3F. 

Set 1A consists of 100,000 MSAs, each of which was 
simulated along a 3-taxon tree starting at a root with 
three child branches. The lengths of the three branches 
were all set at 0.05 (substitutions per base). The total 
rates of insertions and deletions were set at A/ = X D = 0.1 
(per expected substitution), which are close to the 
upper-bounds for neutrally evolving mammalian DNA 
sequences [24, 35]. 

Set IB is nearly the same as Set 1A, expect that all 
branch lengths were set at 0.2 (substitutions per base). 

We prepared 1A and IB, because validating the theoret¬ 
ically predicted occurrence probabilities of local homology 
structures necessitated a large number of MSAs simulated 
under identical parameter settings. 

Each of Sets 3P, 3M and 3F consists of 10,000 MSAs. 
The settings for these three sets differ only in the phylo¬ 
genetic tree used for the simulations. The MSAs in these 
sets were simulated along the tree of 12 primates (panel 
a of Additional file 1: Figure S9), the tree of 15 mammals 
(panel b) and the tree of 9 fast-evolving mammals (panel 
c), respectively. These three sets, 3P, 3M and 3F, were 
intended to mimic typically encountered MSAs among 
selectively neutral DNA sequences with small, moderate 
and large sequence divergences, respectively. The total 
indel rates for these three sets were set at A/ = \ D = 1/ 
16 = 0.0625 (per expected substitution), according to 
genome-wide data analyses [35, 41]. For more details on 
these three sets, see [38]. 

The Dawg control files used to generate these simu¬ 
lated datasets, including the phylogenetic trees and indel 
model parameters, are available as a part of Additional 
file 2. 

Before the analyses, all simulated MSAs were pre- 
processed so that the MSAs with an identical homology 
structure will be replaced with a unique representative 
MSA. See Methods of [38] for details. 
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M3. Program implementation 

The Perl modules and main Perl scripts used in this 
study are available (under the GNU General Public 
License) as a package named “LOLIPOG” (for “LOg- 
Likelihood for the Pattern Of Gaps (in MSA)”) (version: 
“FA_LOLIPOG_P.ver0.6.1.6”), which is archived in 
Additional file 2. The latest version of the package will 
be available in the “lolipog” directory at the FTP reposi¬ 
tory of http://Bioinformatics.Org [58]. 

Endnotes 

1 An ancestry index (or an “ancestry” for short) is 
assigned to each site, in order to distinguish between the 
sites of different evolutionary origins. Once an ancestry 
is assigned to a site, it will not change throughout the 
evolutionary history. Sharing of the same ancestry between 
the sites of different sequences indicates that the sites are 
mutually homologous (more precisely, orthologous). See 
section R2 of [22] for more details. The ancestry indices 
also help realize the rate heterogeneity across regions 
(section R3 of [22]). 

2 In this study, the homology structures we consider 
will usually be among extant sequences, i.e., excluding 
ancestral sequences. When dealing with each particular 
indel history, however, ancestral sequences will also be 
included. 

3 This “perturbative” calculation of a multiplication fac¬ 
tor may look similar to the calculation of a “chop-zone” 
probability proposed by Miklos et al. [26] under their 
“long indel” model. In fact, our “perturbative” calculation 
could be regarded as an (modified) extension of their 
calculation method to somewhat more general evolu¬ 
tionary models. (See section SA-3 of Additional file 2 of 
[22] for their mutual equivalence with caveats.) 

4 For example, the results under the “long indel” model 
[26] are essentially the same as those obtained here, if 
the former is freed from the time-reversibility. 

5 It should be noted that this setting is just a represen¬ 
tative of seemingly common situations. For example, the 
empirically estimated power-law exponents vary between 
near 1 and near 2, although the values within 1.4-1.8 are 
quite common ([24] and references therein), regardless 
of whether the sequences are genomic or protein-coding. 
In the past, the deletion rate was often observed as greater 
than the insertion rate (e.g., [59]). But a simulation study 
[52] indicated that such observations are likely due to a 
bias intrinsic to the similarity-based aligners (as classified 
in [60]). And a recent analysis of mammalian sequences 
via an evolution-based probabilistic aligner (expected to 
be devoid of the above bias) indicated that the deletion 
rate is nearly equal to the insertion rate [41]. It remains to 
be seen whether this rate equality applies also to the taxa 
other than mammals. Regarding the sequence length 
distribution, we do not know any empirical results. 


Although we believe that our choice of the uniform dis¬ 
tribution should be theoretically very reasonable for 
neutrally evolving regions sampled randomly from long 
genomes/chromosomes (see endnote 14 of [22]), distri¬ 
butions of regions under selection may show quite 
different behaviors. It should be interesting, and im¬ 
portant, to examine how the results in this paper will 
change in response to the deviation of the setting from 
that given here. 

6 These cases correspond to the “chop-zones” having 
probabilities Ny [26], with: (i) i=j = 0; (ii) i >0, j = 0; (iii) 
i - 0, j> 0; and (iv) i >0, j > 0. 

7 Because of the symmetry between the finite-time 
transition probabilities under the time reversal, the latter 
results become equal to the former when the insertion 
and deletion parameters are swapped. 

8 The analytical expressions are somewhat complex in 
case (IV). See Appendix A2 of [43] for details. 

9 On top of them, we also prepared yet another set, 
Set 2, created by simulations under 33 sets of parame¬ 
ters typical of the structure-based benchmark MSAs. 
(See section M2.2 of Methods and Figure 28, both of 
[48], for details on this input MSA set.) The validation 
analyses on this Set 2 gave nearly as good results as 
those on the 5 sets described here. (See Results of [48] 
for details.) 

10 Incidentally, for the analysis on relative probabilities, 
we only used local MSAs each of which can result from 
two or more parsimonious local indel histories. The in¬ 
stances of such gapped segments accounted for 4.5 %, 
12.0 %, 17.5 %, 19.2 % and 14.2 % of all “parsimonious” 
instances in sets 1A, IB, 3P, 3 M and 3 F, respectively. 
Out of them, the most likely (ML) histories were wrong 
in 42.3 %, 43.4 %, 36.1 %, 13.1 % and 38.4 % of the in¬ 
stances in the respective sets. Therefore, even if we make 
an unlikely assumption that the aforementioned “non- 
parsimonious” instances were all due to the ML histories 
that are non-parsimonious, any algorithm that searches 
for a single ML history would have overlooked the true 
indel history in 1.9 %, 5.2 %, 6.3 %, 2.5 % and 5.4 % of 
the cases in the respective sets. These frequencies are 
much larger than those of the “non-parsimonious” in¬ 
stances, i.e., 0.15 %, 1.4 %, 0.12 %, 0.23 % and 0.49 %, re¬ 
spectively. This indicates that, given correct MSAs and 
correct trees, our algorithm can recover the true indel 
histories more frequently than any algorithm to search 
for a single ML history. 

n Mixed geometric distributions could decently ap¬ 
proximate a power-law distribution in wider ranges. For 
example, the approximation by a mixed distribution with 
two geometric components (used, e.g., in [23]) is fairly 
good up to a few dozen residues, but it gets poor when 
indels become as long as hundreds of residues. To de¬ 
cently approximate the realistic distributions of such 
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long indels, more geometric components would be 
necessary. 

12 In SM-8 of Supplementary methods of [38], we 
roughly estimated the expected mean length (L loc(T)) 
of a correct (i.e., not reconstructed) local MSA as: 
Lio C {T) » [exp(\T\ X D 1 D ) + exp(\T\ X 1 ~lj) -2]/ 
[l-exp(-\T\ (X D + A/))] . Here, \T\ is the total 
branch length across the tree ( T) (in units of the ex¬ 
pected number of substitutions per site). X D and X If 
respectively, denote the deletion rate and the insertion 
rate, both per expected substitution. And Id and 7/ are the 
average lengths of a deleted subsequence and an 
inserted subsequence, respectively. Under the param¬ 
eter setting for the simulation analyses of this study, we 
have ~l D = h = 6.35. If we set X D = A/ = 1/16 = 
0.0625 according to [35, 41], then, Li 0 C (T) exceeds 100 
when \T\ is greater than 9, which is quite large. Thus, 
as long as we are dealing with correct MSAs , our first- 
approximate algorithm is expected to work on a major¬ 
ity of local MSAs until | T | becomes this large. It should 
be noted, however, that reconstructed MSAs could be 
seriously erroneous even if | T\ is, e.g., less than 3 [38]. 

13 Strictly speaking, the current “local-multi-path down¬ 
hill search” algorithm is not perfect, in the sense that it 
misses some deletion-dominated parsimonious local his¬ 
tories that can yield gap configurations belonging to the 
class of “intersection between cousins” described in [61]. 
Fortunately, this drawback is not expected to be so ser¬ 
ious, because such local histories require at least four 
indels each and thus should be very rare. 

14 We calculated the multiplication factors, fi^p ID ~ 200 ^ 
[case (Hi) ; AlP\ [; tj , t]] with t<E [t h t F \, using E[n ID \l 
N P = 0.001, and compared them to those calculated 
using E[n ID ]/N P = 0.0005. In each pair of values of A L D 
and £, the difference was within 0.2 % of the multiplication 
factor itself. Thus we concluded that they are virtually 
exact. 

Additional files 


Additional file 1: Supplementary methods (sections SM-1 through SM-9), 
Tables SI through S4, and Figures SI through S11. The sections of 
Supplementary methods describe methodological details on the 
(analytical and algorithmic) computations of local alignment probabilities 
(SM-1,2,3,4,5,9) and on the simulation analyses for validating the algorithm 
(SM-6,7,8). Tables S1-S4 show the results of some analyses. Figures SI, S2, S3 
and S11 illustrate some important concepts. Figures S4 and S10 show the 
results of some analyses. Figures S5, S6, S7 and S8 explain the algorithm 
implemented in LOLIPOG. Figure S9 shows the phylogenetic trees used for 
the simulations. (PDF 12355 kb) 

Additional file 2: A package (the version used for the analyses in this 
paper), which contains Perl modules that implement the key algorithms 
and formulas, as well as some main Perl scripts that we used for the 
actual data analyses. (The package is available under the GNU General 
Public License. The modules and scripts will run on a Mac OS X terminal, 
and were confirmed to run also on Red Flad Enterprise Linux (6.4). And 


they will probably run on other UNIX platforms as well, although we have 
not tested whether they indeed do.) The package also contains Dawg 
control files necessary for creating the simulated MSA sets 1 A, IB, 2, 3P, 3M 
and 3F. The latest version of the package ("FA_LOLIPOG_P.verxxx") will 
be available from the "LOLIPOG" (LOg-LIkelihood for the Pattern Of 
Gaps) project at the FTP repository of the Bioinformatics Organization 
[58], (ZIP 7548 kb) 
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